Open Rstudio to do the practicals. Note that tasks with * are optional.
In this practical, a number of R packages are used. The packages used (with versions that were used to generate the solutions) are:
survival
(version: 3.3.1)memisc
(version: 0.99.30.7)ggplot2
(version: 3.3.6)R version 4.2.1 (2022-06-23 ucrt)
For this practical, we will use the heart and
retinopathy data sets from the survival
package. More details about the data sets can be found in:
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/heart.html
https://stat.ethz.ch/R-manual/R-devel/library/survival/html/retinopathy.html
Before starting with any statistical analysis it is important to transform and explore your data set.
age
is equal to age
- 48. Let’s bring
age
back to the normal scale. Do not overwrite the variable
age
, but create a new variable with the name
age_orig
.surgery
into a factor with levels
0: no
and 1: yes
.Use the function factor(…) to convert a numeric variable to a factor.
$age_orig <- heart$age + 48
heart$surgery <- factor(heart$surgery, levels = c(0, 1), labels = c("no", "yes")) heart
Categorize the variable age
from the
retinopathy data set as young
: [minimum
age
until mean age
) and old
:
[mean age
until maximum age
). Give this
variable the name ageCat
. Print the first 6 rows of the
data set retinopathy.
To dichotomize a numeric variable combine the function as.numeric(…) with a logical condition (e.g., as.numeric(X > 2)). This logical condition will split the numeric variable into two parts (young and old). Use the function factor(…) to convert a variable into a factor.
$ageCat <- as.numeric(retinopathy$age >= mean(retinopathy$age))
retinopathy$ageCat <- factor(retinopathy$ageCat, levels = c(0, 1), labels = c("young", "old"))
retinopathyhead(retinopathy)
Categorize futime
from data set
retinopathy as follows:
short
: [minimum futime
until 25).medium
: [25 until 45).long
: [45 until maximum futime
).futimeCat
. Print the first 6
rows of the data.Create a variable that is identical to the futime variable (use the name futimeCut). Then use indexing (e.g., X[X < 25]) to select the correct subset of the new variable futimeCut and set it to the new category (e.g., “short”).
E.g. you can create the low category as:
retinopathy$futimeCut <- retinopathy$futime
retinopathy$futimeCut[retinopathy$futime < 25] <- "short"
Now continue with the other categories.
$futimeCut <- retinopathy$futime
retinopathy$futimeCut[retinopathy$futime < 25] <- "short"
retinopathy$futimeCut[retinopathy$futime >= 25 & retinopathy$futime < 45] <- "medium"
retinopathy$futimeCut[retinopathy$futime >= 45] <- "long"
retinopathyhead(retinopathy)
Create 2 vectors of size 50 as follows:
Sex
: takes 2 values 0 and 1.Age
: takes values from 20 till 80.Sex
variable into a factor with levels 0:
female
and 1: male
.AgeCat
as dichotomous with
Age
<= 50 to be 0 and 1 otherwise.AgeCat
variable into a factor with levels
0: young
and 1: old
.Age
variable by \(\frac{Age-mean(Age)}{sd(Age)}\).To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…). To dichotomize a numeric variable use the function as.numeric(…).
<- sample(0:1, 50, replace = T)
Sex <- sample(20:80, 50, replace = T)
Age <- factor(Sex, levels = c(0:1), labels = c("female", "male"))
Sex <- as.numeric(Age > 50)
AgeCat <- factor(AgeCat, levels = c(0:1), labels = c("young", "old"))
AgeCat <- (Age - mean(Age))/sd(Age) Age
Create a data frame with the name DF
as follows:
Sex
, Age
,
AgeCat
form the previous Task.Gender
, StandardizedAge
,
DichotomousAge
.<- data.frame(Sex, Age, AgeCat)
DF <- data.frame("Gender" = Sex, "StandardizedAge" = Age, "DichotomousAge" = AgeCat) DF
Create 2 vectors of size 150 as follows:
Treatment
: takes 2 values 1 and 2.Weight
: takes values from 50 till 100.Treatment
variable into a factor with
levels 1: no
and 2: yes
.Weight
variable by Weight
*
1000.Treatment
and
Weight
.To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…).
<- sample(1:2, 150, replace = T)
Treatment <- sample(50:100, 150, replace = T)
Weight <- factor(Treatment, levels = c(1:2), labels = c("no", "yes"))
Treatment <- Weight * 1000
Weight data.frame(Treatment, Weight)
Create a list called my_list
with the following:
let
: a
to i
.sex
: factor taking the values males
and
females
and length 50.mat
: matrix
1 | 2 |
3 | 4 |
To obtain letters use the function letters(…). To sample a numeric and categorical variable use the function sample(…). To convert a numeric variable to a categorical use the function factor(…).
<- letters[1:9]
let <- sample(1:2, 50, replace = TRUE)
sex <- factor(sex, levels = 1:2, labels = c("males", "females"))
sex <- matrix(1:4 ,2, 2, byrow = TRUE)
mat <- list(let = let, sex = sex, mat = mat) my_list
Let’s obtain some descriptive statistics.
Obtain the mean and standard deviation for the variable
age
using the heart data set.
Use the functions mean(…) and sd(…).
mean(heart$age)
## [1] -2.484027
sd(heart$age)
## [1] 9.419999
Using the retinopathy data set:
age
.
type
.age
.Use the functions median(…) and IQR(…) to obtain the median and the interquartile range. Load the package memisc and use the function percent(…) in order to obtain the percentages. To check whether there are missing values use the functions sum(is.na(…)).
median(retinopathy$age)
## [1] 16
IQR(retinopathy$age)
## [1] 20
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
percent(retinopathy$type)
## juvenile adult N
## 57.86802 42.13198 394.00000
sum(is.na(retinopathy$age)) # any(is.na(retinopathy$age))
## [1] 0
Using the data frame DF
from the exercise before (Task
5):
StandardizedAge
.StandardizedAge
.Gender
.DichotomousAge
.Gender
and
DichotomousAge
(crosstab table).To calculate the frequencies, use the functions length(…) or table(…). To obtain the dimensions use the function dim(…).
mean(DF$StandardizedAge)
## [1] 1.891608e-16
sd(DF$StandardizedAge)
## [1] 1
length(DF$Gender[DF$Gender == "female"])
## [1] 22
length(DF$Gender[DF$Gender == "male"])
## [1] 28
table(DF$Gender)
##
## female male
## 22 28
table(DF$Gender, DF$DichotomousAge)
##
## young old
## female 15 7
## male 16 12
dim(DF)
## [1] 50 3
Obtain the pearson and spearman correlation of the variables
year
and age
of the heart
data
set.
To calculate the correlations, use the function cor(…) and check the argument method.
cor(heart$year, heart$age, method = "pearson")
## [1] -0.1623965
cor(heart$year, heart$age, method = "spearman")
## [1] -0.1770664
Let’s visualize the data.
Using the heart data set:
age
and
year
.Age
for the x-axis and Year of acceptance
for
the y-axis.Use the function plot(…, xlab, ylab, col). Use the function legend(…) to add a legend to the plot.
plot(heart$age, heart$year)
plot(heart$age, heart$year, xlab = "Age", ylab = "Year of acceptance")
plot(heart$age, heart$year, xlab = "Age", ylab = "Year of acceptance", col = heart$transplant)
legend(-40, 6, c("no", "yes"), col = c("black", "red"), pch = 1)
Using the retinopathy data set:
age
per status
.Use the function boxplot(…).
boxplot(retinopathy$age ~ retinopathy$status)
boxplot(retinopathy$age ~ retinopathy$status, col = c("blue", "green"))
Using the retinopathy data set:
age
with
risk
.age
per type
group.Use the ggplot2 package and the functions: geom_smooth(…) and geom_density(…).
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:memisc':
##
## syms
ggplot(retinopathy, aes(age, risk)) +
geom_smooth(colour = 'black', span = 0.4)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
ggplot(retinopathy, aes(age, fill = type)) +
geom_density(alpha = 0.25)
© Eleni-Rosalina Andrinopoulou